Analysis of Fifa World Cup Tournaments (1930-2022)¶

Bethel Biniam, Fall 2022¶

image.png

Introduction¶

Motivation¶

Every four years, the FIFA World Cup is held and overtakes the media for weeks. In fact, it is due to end in just a few days time. As someone who enjoys it casually, just watching for the fun of the game, I was excited by the possible data analyses that could be performed on the data. What can be said about the playing teams? Are there any trends noticable? How can I examine these things?

The World Cup of 2022 has been shocking in more ways than one: dark horses like Japan and South Korea making it far into the tournament, Morocco being the first African team to make it to semi finals, the heated match between the Netherlands and Argentina team (shown in the image above), just to name a few.

As a computer science student, I thought the hype around the world cup would lend itself well to making my report that much more interesting. Follow along with me as I explore the trends of performance in the World Cup over the years, making discoveries you may expect, and some you may not. We will look at how games played out in this tournament, as well as those in previous years, all the way back to 1930.

Considering that the World Cup is not yet done as of the finalization of this report (December 16th), it is exclusive of the final games.

This project consists of a few sections:

  1. Data Collection
  2. Data Management and Cleaning
  3. Exploratory Data Analysis
  4. Hypothesis Testing
  5. Communication of Insights

Required Libraries/Tools¶

In order to perform data analysis on this dataset, I need the help of some python packages. The following will be used throughout the project. They will be used for the following purposes:

  1. pandas: used to manipulate dataframes
  2. numpy: used to calculate statistics
  3. matplotlib: used to create plots
  4. sklearn: used to model regression
  5. statsmodels: used to perform hypothesis testing
  6. pycountry: usedto translate country names to continent names. I import a specific module, pycountry_convert to do this.
    • In order to use this package, you will have to install it on your machine. If you're using Jupyter Notebooks like me, that can be done like below, with ! pip install pycountry_convert
In [502]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
! pip install pycountry_convert
import pycountry_convert as pc
from sklearn.linear_model import LinearRegression
Requirement already satisfied: pycountry_convert in /opt/conda/lib/python3.9/site-packages (0.7.2)
Requirement already satisfied: repoze.lru>=0.7 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.7)
Requirement already satisfied: pytest-mock>=1.6.3 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (3.10.0)
Requirement already satisfied: wheel>=0.30.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.37.1)
Requirement already satisfied: pprintpp>=0.3.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.4.0)
Requirement already satisfied: pytest-cov>=2.5.1 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (4.0.0)
Requirement already satisfied: pytest>=3.4.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (7.2.0)
Requirement already satisfied: pycountry>=16.11.27.1 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (22.3.5)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.9/site-packages (from pycountry>=16.11.27.1->pycountry_convert) (59.8.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (21.3)
Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (21.4.0)
Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.0.4)
Requirement already satisfied: tomli>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.2.2)
Requirement already satisfied: iniconfig in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.1.1)
Requirement already satisfied: pluggy<2.0,>=0.12 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.0.0)
Requirement already satisfied: coverage[toml]>=5.2.1 in /opt/conda/lib/python3.9/site-packages (from pytest-cov>=2.5.1->pycountry_convert) (6.5.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.9/site-packages (from packaging->pytest>=3.4.0->pycountry_convert) (3.0.7)

About the Dataset¶

I will be using the FIFA Football World Cup dataset in this report.

It has a few separate datasets depending on whether you want to view data from the previous tournaments (1930-2018), the current one, or general data about the teams. I will be combining two of the provided datasets: matches_1930_2018.csv and matches_2022.csv.

Though there are many fields in this dataset, I will be narrowing them down to the ones I'm interested in analysing:

  • home_team, away_team - Home and Away Teams,
  • home_score, away_score - Scores,
  • home_xg, away_xg - XG (Expected Goals): this will be expanded on in a later section
  • home_penalty, away_penalty - Penalties,
  • home_yellow_card_long, away_yellow_card_long - Yellow Cards Given,
  • Date,
  • Score,
  • Year

Extra Readings/Viewings¶

If you want to learn more about the workings of the Fifa World Cup tournaments to better understand the analyses going forward, check out these links:

  1. What is Expected Goals in FIFA 22? How xG is calculated
  2. Fifa World Cup: Wikipedia
  3. What happens if a knockout game ends in a tie? World Cup extra time and penalty shootouts, explained

1. Data Collection¶

Using the read_csv function of pandas, I extract the data from both datasets and into a dataframe. This makes the data analysis process extremely simple. If you are interested in learning more about dataframes in pandas, check out the documentation.

After both dataframes are formed, one for past tournaments and one for the current tournament, I will start cleaning the data.

In [503]:
# Dataset of games from 1930 to 2018
fifa_past = pd.read_csv("matches_1930_2018.csv")
fifa_past
Out[503]:
home_team away_team home_score home_xg home_penalty away_score away_xg away_penalty home_manager home_captain ... home_penalty_shootout_miss_long away_penalty_shootout_miss_long home_red_card away_red_card home_yellow_red_card away_yellow_red_card home_yellow_card_long away_yellow_card_long home_substitute_in_long away_substitute_in_long
0 France Croatia 4 1.1 NaN 2 1.1 NaN Didier Deschamps Hugo Lloris ... NaN NaN NaN NaN NaN NaN ["27&rsquor;|1:0|N'Golo Kanté", '41&rsquor;|2:... ['90+2&rsquor;|4:2|Šime Vrsaljko'] ["54&rsquor;|2:1|Steven Nzonzi|for N'Golo Kant... ['71&rsquor;|4:2|Andrej Kramarić|for Ante Rebi...
1 Belgium England 2 1.5 NaN 0 1.0 NaN Roberto Martínez Eden Hazard ... NaN NaN NaN NaN NaN NaN ['90+3&rsquor;|2:0|Axel Witsel'] ['52&rsquor;|1:0|John Stones', '77&rsquor;|1:0... ['39&rsquor;|1:0|Thomas Vermaelen|for Nacer Ch... ['46&rsquor;|1:0|Jesse Lingard|for Danny Rose'...
2 Croatia England 2 1.7 NaN 1 0.6 NaN Zlatko Dalić Luka Modrić ... NaN NaN NaN NaN NaN NaN ['48&rsquor;|0:1|Mario Mandžukić', '96&rsquor;... ['54&rsquor;|0:1|Kyle Walker'] ['95&rsquor;|1:1|Josip Pivarić|for Ivan Strini... ['74&rsquor;|1:1|Marcus Rashford|for Raheem St...
3 France Belgium 1 1.7 NaN 0 0.4 NaN Didier Deschamps Hugo Lloris ... NaN NaN NaN NaN NaN NaN ["87&rsquor;|1:0|N'Golo Kanté", '90+3&rsquor;|... ['63&rsquor;|1:0|Eden Hazard', '71&rsquor;|1:0... ['85&rsquor;|1:0|Steven Nzonzi|for Olivier Gir... ['60&rsquor;|1:0|Dries Mertens|for Mousa Dembé...
4 Sweden England 0 0.5 NaN 2 1.0 NaN Janne Andersson Andreas Granqvist ... NaN NaN NaN NaN NaN NaN ['88&rsquor;|0:2|John Guidetti', '90+4&rsquor;... ['87&rsquor;|0:2|Harry Maguire'] ['65&rsquor;|0:2|John Guidetti|for Ola Toivone... ['77&rsquor;|0:2|Fabian Delph|for Dele Alli', ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
895 Argentina France 1 NaN NaN 0 NaN NaN Francisco Olazar Manuel Ferreira ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
896 Yugoslavia Brazil 2 NaN NaN 1 NaN NaN Bosko Simonovic Milutin Ivković ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
897 Romania Peru 3 NaN NaN 1 NaN NaN Octav Luchide Emerich Vogl ... NaN NaN NaN Plácido Galindo · 70 NaN NaN NaN NaN NaN NaN
898 United States Belgium 3 NaN NaN 0 NaN NaN Bob Millar Tom Florie ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
899 France Mexico 4 NaN NaN 1 NaN NaN Raoul Caudron Alexandre Villaplane ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

900 rows × 44 columns

In [504]:
# Dataset of games from the 2022 World Cup so far.
fifa_2022 = pd.read_csv("matches_2022.csv")
fifa_2022
Out[504]:
home_team away_team home_score home_xg home_penalty away_score away_xg away_penalty home_manager home_captain ... home_penalty_shootout_miss_long away_penalty_shootout_miss_long home_red_card away_red_card home_yellow_red_card away_yellow_red_card home_yellow_card_long away_yellow_card_long home_substitute_in_long away_substitute_in_long
0 Qatar Ecuador 0.0 0.3 NaN 2.0 1.2 NaN Félix Sánchez Hassan Al-Haydos ... NaN NaN NaN NaN NaN NaN ['15&rsquor;|0:0|Saad Al Sheeb', '22&rsquor;|0... ['29&rsquor;|0:1|Moisés Caicedo', '56&rsquor;|... ['71&rsquor;|0:2|Mohammed Waad|for Hassan Al-H... ['68&rsquor;|0:2|Jeremy Sarmiento|for Romario ...
1 England IR Iran 6.0 2.1 NaN 2.0 1.4 NaN Gareth Southgate Harry Kane ... NaN NaN NaN NaN NaN NaN NaN ['25&rsquor;|0:0|Alireza Jahanbakhsh', '48&rsq... ['70&rsquor;|4:1|Eric Dier|for Harry Maguire',... ['20&rsquor;|0:0|Hossein Hosseini|for Alireza ...
2 United States Wales 1.0 0.8 NaN 1.0 1.5 NaN Gregg Berhalter Tyler Adams ... NaN NaN NaN NaN NaN NaN ['11&rsquor;|0:0|Sergiño Dest', '13&rsquor;|0:... ['40&rsquor;|1:0|Gareth Bale', '45+2&rsquor;|1... ['66&rsquor;|1:0|Brenden Aaronson|for Weston M... ['46&rsquor;|1:0|Kieffer Moore|for Daniel Jame...
3 Senegal Netherlands 0.0 0.9 NaN 2.0 0.7 NaN Aliou Cissé Kalidou Koulibaly ... NaN NaN NaN NaN NaN NaN ['90+4&rsquor;|0:1|Nampalys Mendy', '90+6&rsqu... ['56&rsquor;|0:0|Matthijs de Ligt'] ['62&rsquor;|0:0|Ismail Jakobs|for Abdou Diall... ['62&rsquor;|0:0|Memphis|for Vincent Janssen',...
4 France Australia 4.0 4.0 NaN 1.0 0.5 NaN Didier Deschamps Hugo Lloris ... NaN NaN NaN NaN NaN NaN NaN ['55&rsquor;|2:1|Mitchell Duke', '80&rsquor;|4... ['13&rsquor;|0:1|Theo Hernández|for Lucas Hern... ['56&rsquor;|2:1|Jason Cummings|for Mitchell D...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59 Morocco Portugal 1.0 1.4 NaN 0.0 0.9 NaN Hoalid Regragui Romain Saïss ... NaN NaN NaN NaN Walid Cheddira · 90+3 NaN ['70&rsquor;|1:0|Achraf Dari', '90+1&rsquor;|1... ['87&rsquor;|1:0|Vitinha'] ['57&rsquor;|1:0|Achraf Dari|for Romain Saïss'... ['51&rsquor;|1:0|João Cancelo|for Raphaël Guer...
60 Argentina Croatia 3.0 2.3 NaN 0.0 0.5 NaN Lionel Scaloni Lionel Messi ... NaN NaN NaN NaN NaN NaN ['68&rsquor;|2:0|Cristian Romero', '71&rsquor;... ['32&rsquor;|0:0|Mateo Kovačić', '32&rsquor;|0... ['62&rsquor;|2:0|Lisandro Martínez|for Leandro... ['46&rsquor;|2:0|Mislav Oršić|for Borna Sosa',...
61 France Morocco 2.0 1.9 NaN 0.0 0.9 NaN Didier Deschamps Hugo Lloris ... NaN NaN NaN NaN NaN NaN NaN ['27&rsquor;|1:0|Sofiane Boufal'] ['65&rsquor;|1:0|Marcus Thuram|for Olivier Gir... ['21&rsquor;|1:0|Selim Amallah|for Romain Saïs...
62 Croatia Morocco NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
63 Argentina France NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

64 rows × 44 columns

2. Data Management and Cleaning¶

Data cleansing is a messy process of "identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity. Data cleaning is considered a foundational element of the basic data science." It is an integral part of the Data Science Pipeline. To learn more, visit What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling?

I cannot yet join the fifa_past and fifa_2022 dataframes, since I intend to perform some analysis on data from this year's tournament only in one case. Thus, the following cleansing will appear duplicated, since I perform it for both datasets.

Part 1: Converting Yellow Card Field to Numerical Value¶

Given the tension in this year's World Cup, especially between the Netherlands and Argentina, I was immediately interested in preserving this field to analyze in depth later. However, the dataset lists the yellow card attribute appears to be corrupted or in a form that could not be read properly.

There is some sense to it, though. It appears to be in list form, but given as a string. By parsing it and turning it into a list, then taking the length, we get the number of yellow cards given.

In [505]:
# For each past game...
for ind, game in fifa_past.iterrows():        

    # If the away team received Yellow Cards...
    if pd.notnull(game['away_yellow_card_long']): 
        
        # Strip the array brackets, and split on the comma character, then take the length of that list
        fifa_past.loc[ind, 'away_yellows'] = len(str((game['away_yellow_card_long'])).strip('][').split(', '))
    else: 
        # If there were no yellow cards given in this game, insert 0
        fifa_past.loc[ind, 'away_yellows'] = 0
        
    if pd.notnull(game['home_yellow_card_long']): # Same process for the home team
        fifa_past.loc[ind, 'home_yellows'] = len(str((game['home_yellow_card_long'])).strip('][').split(', '))
    else: 
        fifa_past.loc[ind, 'home_yellows'] = 0
        
    
# Same process as above, but for the 2022 dataset
for ind, game in fifa_2022.iterrows():
    if pd.notnull(game['away_yellow_card_long']): # away
        fifa_2022.loc[ind, 'away_yellows'] = len(str((game['away_yellow_card_long'])).strip('][').split(', '))
    else: 
        fifa_2022.loc[ind, 'away_yellows'] = 0
        
    if pd.notnull(game['home_yellow_card_long']): # home
        fifa_2022.loc[ind, 'home_yellows'] = len(str((game['home_yellow_card_long'])).strip('][').split(', '))
    else: 
        fifa_2022.loc[ind, 'home_yellows'] = 0
        

Part 2: Adding the Continent Field¶

The dataset in its current form provides us with a lot of interesting information, but we can step it up a notch. I am interested in finding out what trends may be prevalent when we look at the continent that each team is from. To that end, I use the pycountry_convery module installed in the beginning to convert team/country names to continent names. This must be done for both the home and away team.

However, not all team names are the name of the country from which they live. In those cases, pycountry_convert will throw a KeyError. For the time being, I will catch that error by placing 'TBD' (to be determined) in the field.

In [506]:
# Converts given team/country name to the continent that the team is from
def country_to_continent(country_name):
    country_a2 = pc.country_name_to_country_alpha2(country_name) # 2 letter code
    country_continent_code = pc.country_alpha2_to_continent_code(country_a2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name

# For each past game...
for ind, game in fifa_past.iterrows(): 
    
    # Try using pycountry_convert to provide the continent of that team, and assign it if possible
    try:
        fifa_past.loc[ind, 'home_continent'] = country_to_continent(game['home_team'])
    except KeyError:
        # Label teams not named after their country with TBD
        fifa_past.loc[ind, 'home_continent'] = 'TBD'
        
    # Repeating the process for the away team
    try:
        fifa_past.loc[ind, 'away_continent'] = country_to_continent(game['away_team'])
    except KeyError:
        fifa_past.loc[ind, 'away_continent'] = 'TBD'
        
    
# Same process as above for the 2022 dataset
for ind, game in fifa_2022.iterrows(): 
    try:
        fifa_2022.loc[ind, 'home_continent'] = country_to_continent(game['home_team'])
    except KeyError:
        fifa_2022.loc[ind, 'home_continent'] = 'TBD'
        
    try:
        fifa_2022.loc[ind, 'away_continent'] = country_to_continent(game['away_team'])
    except KeyError:
        fifa_2022.loc[ind, 'away_continent'] = 'TBD'
    

There are quite a few teams that aren't named after their country, or are named slightly differently, or may no longer exist at all! That is to be expected since the World Cup has been around since the 1930s. There are several teams and regions that no longer go by the names they used to.

For each of these teams, I personally assign them to the continent they belong, or belonged, to. I then go back through looking for every TBD field and assign the proper continent to the respective team.

In [507]:
# Mini dataframes for both home and away teams with TBD in the column field
home_tbd = fifa_past[fifa_past['home_continent'] == 'TBD']
away_tbd = fifa_past[fifa_past['away_continent'] == 'TBD']

# This is the list of teams from countries that no longer exist, or go by have different names
tbd = away_tbd['away_team'].unique() 

# Using the above variable, I assign each team to the continent they are from
conts = {'Europe': ['England', 'Wales', 'Scotland', 'Germany DR', 'Soviet Union', 'Yugoslavia', 'Czechoslovakia', 
                    'FR Yugoslavia', 'West Germany', 'Republic of Ireland', 'Northern Ireland', 'Serbia and Montenegro'], 
         'Asia': ['Korea Republic', 'Dutch East Indies', 'IR Iran', 'Korea DPR', 'Türkiye', 'China PR'], 'Africa': ['Zaire']}

# For each game...
for ind, game in fifa_past.iterrows():
    # If the continent is not listed for this team
    if game['home_continent'] == 'TBD':
        
        # Looking for the continent that has this team's name in it
        for c in conts:
            if game['home_team'] in conts[c]:
                fifa_past.loc[ind, 'home_continent'] = c  # Assigning the proper continent to the team
                
# Same process for the away TBD teams
for ind, game in fifa_past.iterrows():
    if game['away_continent'] == 'TBD':
        for c in conts:
            if game['away_team'] in conts[c]:
                fifa_past.loc[ind, 'away_continent'] = c
            

Below, the process is repeated to assign a continent to the team names that do not yet have one.

In [508]:
# Similar to above, but for the 2022 dataset
for ind, game in fifa_2022.iterrows():
    if game['home_continent'] == 'TBD':
        for c in conts:
            if game['home_team'] in conts[c]:
                fifa_2022.loc[ind, 'home_continent'] = c
                
for ind, game in fifa_2022.iterrows():
    if game['away_continent'] == 'TBD':
        for c in conts:
            if game['away_team'] in conts[c]:
                fifa_2022.loc[ind, 'away_continent'] = c
            

Part 3. Winner Column and Continent TODO explain continent everywhere too¶

The current dataset does not take into account the winner of any given game. I think it would be interesting to perform analysis on in the future, so we are going to create that column ourselves. The winner of a game will be calculated in one of three ways:

  1. The team that scored most points, wins
  2. In the event of a tie, and the game had a penalty shootout, the team with the most penalties wins
  3. In the event of a tie, and there was no penalty shootout, the game ended in a draw

Draws were used in the eliminatories to determine which teams in a group would advance to the round of the final 16. After making it into the final 16, there are no draws, and the stakes of the game increase.

In [509]:
# For each game...
for ind, game in fifa_past.iterrows():
    
    # If there is a tie...
    if game['home_score'] == game['away_score']:
        
        # If there were no penalty shootouts, the game was a DRAW
        if pd.isnull(game['home_penalty']):
            win_cont = 'DRAW'
            win = 'DRAW'
            
        # Else, the winning team is the team with the most penalties scored
        elif game['home_penalty'] > game['away_penalty']:
            win_cont = game['home_continent']
            win = game['home_team']
        else:
            win_cont = game['away_continent']
            win = game['away_team']
            
    # If there was no tie, the winning team is the team with the most points scored        
    else:
        if game['home_score'] > game['away_score']:
            win_cont = game['home_continent']
            win = game['home_team']
        else: 
            win_cont = game['away_continent']
            win = game['away_team']
            
    fifa_past.loc[ind, 'winner_cont'] = win_cont  # Assigning the name/cont of the winning team
    fifa_past.loc[ind, 'winner'] = win  
    
    
# Repeat for the 2022 dataset
for ind, game in fifa_2022.iterrows():
    if game['home_score'] == game['away_score']:
        if pd.isnull(game['home_penalty']):
            win_cont = 'DRAW'
            win = 'DRAW'
        elif game['home_penalty'] > game['away_penalty']:
            win_cont = game['home_continent']
            win = game['home_team']
        else:
            win_cont = game['away_continent']
            win = game['away_team']
            
    else:
        if game['home_score'] > game['away_score']:
            win = game['home_team']
            win_cont = game['home_continent']
        else: 
            win_cont = game['away_continent']
            win = game['away_team']
         
    fifa_2022.loc[ind, 'winner_cont'] = win_cont
    fifa_2022.loc[ind, 'winner'] = win
    

Part 4: Reducing Fields¶

The dataset came with 44 fields of data included, but we don't want to keep too many of those. I narrow down the fields to the ones I am most interested in, including the ones I created above. This includes scores, penalties, expected goals, year, and yellow card information.

In [520]:
# Columns I'm interested in
cols = ['home_team', 'away_team', 'home_score', 'home_xg', 'home_penalty', 'away_score', 'away_xg', 'away_penalty',  
        'Score', 'Year', 'home_yellow_card_long', 'away_yellow_card_long']

# Columns I created
created = ['away_yellows', 'home_yellows', 'home_continent', 'away_continent', 'winner', 'winner_cont']

# Removing the detailed descriptions of yellow cards
yellow_long = ['home_yellow_card_long', 'away_yellow_card_long']

# Selecting only the fields in 'col' as well as the ones I created
new_cols = [attr for attr in (cols + created) if attr not in yellow_long]
fifa_past = fifa_past[new_cols]
fifa_2022 = fifa_2022[new_cols]  # Reducing the dimensionality of the dataset.

fifa_past
Out[520]:
home_team away_team home_score home_xg home_penalty away_score away_xg away_penalty Score Year away_yellows home_yellows home_continent away_continent winner winner_cont
0 France Croatia 4 1.1 NaN 2 1.1 NaN 4–2 2018 1.0 2.0 Europe Europe France Europe
1 Belgium England 2 1.5 NaN 0 1.0 NaN 2–0 2018 2.0 1.0 Europe Europe Belgium Europe
2 Croatia England 2 1.7 NaN 1 0.6 NaN 2–1 2018 1.0 2.0 Europe Europe Croatia Europe
3 France Belgium 1 1.7 NaN 0 0.4 NaN 1–0 2018 3.0 2.0 Europe Europe France Europe
4 Sweden England 0 0.5 NaN 2 1.0 NaN 0–2 2018 1.0 2.0 Europe Europe England Europe
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
895 Argentina France 1 NaN NaN 0 NaN NaN 1–0 1930 0.0 0.0 South America Europe Argentina South America
896 Yugoslavia Brazil 2 NaN NaN 1 NaN NaN 2–1 1930 0.0 0.0 Europe South America Yugoslavia Europe
897 Romania Peru 3 NaN NaN 1 NaN NaN 3–1 1930 0.0 0.0 Europe South America Romania Europe
898 United States Belgium 3 NaN NaN 0 NaN NaN 3–0 1930 0.0 0.0 North America Europe United States North America
899 France Mexico 4 NaN NaN 1 NaN NaN 4–1 1930 0.0 0.0 Europe North America France Europe

900 rows × 16 columns

In [521]:
fifa_2022
Out[521]:
home_team away_team home_score home_xg home_penalty away_score away_xg away_penalty Score Year away_yellows home_yellows home_continent away_continent winner winner_cont
0 Qatar Ecuador 0.0 0.3 NaN 2.0 1.2 NaN 0–2 2022 2.0 4.0 Asia South America Ecuador South America
1 England IR Iran 6.0 2.1 NaN 2.0 1.4 NaN 6–2 2022 2.0 0.0 Europe Asia England Europe
2 United States Wales 1.0 0.8 NaN 1.0 1.5 NaN 1–1 2022 2.0 4.0 North America Europe DRAW DRAW
3 Senegal Netherlands 0.0 0.9 NaN 2.0 0.7 NaN 0–2 2022 1.0 2.0 Africa Europe Netherlands Europe
4 France Australia 4.0 4.0 NaN 1.0 0.5 NaN 4–1 2022 3.0 0.0 Europe Oceania France Europe
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59 Morocco Portugal 1.0 1.4 NaN 0.0 0.9 NaN 1–0 2022 1.0 2.0 Africa Europe Morocco Africa
60 Argentina Croatia 3.0 2.3 NaN 0.0 0.5 NaN 3–0 2022 2.0 2.0 South America Europe Argentina South America
61 France Morocco 2.0 1.9 NaN 0.0 0.9 NaN 2–0 2022 1.0 0.0 Europe Africa France Europe
62 Croatia Morocco NaN NaN NaN NaN NaN NaN NaN 2022 0.0 0.0 Europe Africa Morocco Africa
63 Argentina France NaN NaN NaN NaN NaN NaN NaN 2022 0.0 0.0 South America Europe France Europe

64 rows × 16 columns

Part 5: Combining the Datasets¶

Now that all the cleansing has been done to both datasets, it's time to combine them. This is done easily with a function from pandas called concat. Taking two dataframes as parameters, it returns one that has the data of both of them.

I also take this opportunity to add one more field: yellows. This field sums the number of yellow cards received by the home and away teams. It is useful for analyzing the number of yellow cards given in a single game.

In [522]:
# Combined dataset for the Fifa World Cups
fifa = pd.concat([fifa_2022[pd.notnull(fifa_2022['home_score'])], fifa_past]).reset_index(drop=True)
fifa['yellows'] = fifa['home_yellows'] + fifa['away_yellows']
fifa
Out[522]:
home_team away_team home_score home_xg home_penalty away_score away_xg away_penalty Score Year away_yellows home_yellows home_continent away_continent winner winner_cont yellows
0 Qatar Ecuador 0.0 0.3 NaN 2.0 1.2 NaN 0–2 2022 2.0 4.0 Asia South America Ecuador South America 6.0
1 England IR Iran 6.0 2.1 NaN 2.0 1.4 NaN 6–2 2022 2.0 0.0 Europe Asia England Europe 2.0
2 United States Wales 1.0 0.8 NaN 1.0 1.5 NaN 1–1 2022 2.0 4.0 North America Europe DRAW DRAW 6.0
3 Senegal Netherlands 0.0 0.9 NaN 2.0 0.7 NaN 0–2 2022 1.0 2.0 Africa Europe Netherlands Europe 3.0
4 France Australia 4.0 4.0 NaN 1.0 0.5 NaN 4–1 2022 3.0 0.0 Europe Oceania France Europe 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
957 Argentina France 1.0 NaN NaN 0.0 NaN NaN 1–0 1930 0.0 0.0 South America Europe Argentina South America 0.0
958 Yugoslavia Brazil 2.0 NaN NaN 1.0 NaN NaN 2–1 1930 0.0 0.0 Europe South America Yugoslavia Europe 0.0
959 Romania Peru 3.0 NaN NaN 1.0 NaN NaN 3–1 1930 0.0 0.0 Europe South America Romania Europe 0.0
960 United States Belgium 3.0 NaN NaN 0.0 NaN NaN 3–0 1930 0.0 0.0 North America Europe United States North America 0.0
961 France Mexico 4.0 NaN NaN 1.0 NaN NaN 4–1 1930 0.0 0.0 Europe North America France Europe 0.0

962 rows × 17 columns

3. Exploratory data analysis¶

  1. continent winning most games TODO
  2. the residual thing (mean xg - mean goals made fr)

Part 1: Summary Statistics¶

All the cleaning is over and done with. Now, we can start with data analysis. The first thing I would like to do is example the summary statistics of a few fields. Summary statistics are useful for getting a glance at the distribution of the data without creating any plots first.

Namely, we will look at summary stats for home_score, away_score, and yellows. This process is made simple by using the numpy provided functions to calculate mean, median, minimum, maximum, and standard deviation.

To learn more about summary statistics and their uses, check out this link on Summary statistics.

In [513]:
# Summary Stats for Home Score
mean = np.mean(fifa['home_score'])
median = np.median(fifa['home_score'])
mini = np.min(fifa['home_score'])
maxi = np.max(fifa['home_score'])
stddev = np.std(fifa['home_score'])

print('Summary Stats for HOME SCORE')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for HOME SCORE
----------------------------------
Mean:  1.7754677754677755 
Std Dev:  1.597002467430215 
Median:  1.0 
Min:  0.0 
Max:  10.0
In [514]:
# Summary Stats for Away Score
mean = np.mean(fifa['away_score'])
median = np.median(fifa['away_score'])
mini = np.min(fifa['away_score'])
maxi = np.max(fifa['away_score'])
stddev = np.std(fifa['away_score'])

print('Summary Stats for AWAY SCORE')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for AWAY SCORE
----------------------------------
Mean:  1.0426195426195426 
Std Dev:  1.0699314095569636 
Median:  1.0 
Min:  0.0 
Max:  7.0
In [515]:
# Summary Stats for Yellows Combined
mean = np.mean(fifa['away_yellows'] + fifa['home_yellows'])
median = np.median(fifa['away_yellows'] + fifa['home_yellows'])
mini = np.min(fifa['away_yellows'] + fifa['home_yellows'])
maxi = np.max(fifa['away_yellows'] + fifa['home_yellows'])
stddev = np.std(fifa['away_yellows'] + fifa['home_yellows'])

print('Summary Stats for YELLOWS COMBINED')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for YELLOWS COMBINED
----------------------------------
Mean:  2.6715176715176714 
Std Dev:  2.347688646183192 
Median:  2.0 
Min:  0.0 
Max:  17.0

Part 2: Visualizing Distributions¶

For this next section, we will be visualising the data using plots, courtesy of the matplotlib library. Each plot will seek to answer a question involving one or more fields in the dataset. Our goal here is to see if we notice any trends.

Question 1: How have scores changed over time?¶

This first question seeks to know whether or not scores have changed over time. Preliminarily, I will use a scatter plot to display this data. The x axis will be the year of the game, and the y axis will have the number of points scored in the game. This information will be split into two plots, one for the home team, and one for the away team. Having two graphs side by side is something that can be done using the very helpful function subplots.

Using subplots is more advantageous than using subplot since we can standardize the length of the y axis across both plots. Normally, a plot will have a y axis that is approximately the range of the data. Since the range of points scored in this case is different between teams, it would be misleading to see the two graphs with different y ranges. Using subplots fixes that.

In [516]:
# Graphing year by points scored, using helpful function subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), sharey=True) 
fig.suptitle('Scatter Plot of Home Scores/Away Scores over Time')
ax1.scatter(fifa['Year'], fifa['home_score'])
ax2.scatter(fifa['Year'], fifa['away_score'])

# Labelling each plot
ax1.set_xlabel("Year")
ax1.set_ylabel("Points Scored")
ax1.set_title("Home")

ax2.set_xlabel("Year")
ax2.set_ylabel("Points Scored")
ax2.set_title("Away")

plt.show()

It looks like a scatter plot is not the best plot to use for the data. As you can see, the values of home_score and away_score are always whole numbers. It's also impossible to tell which amount of points scored has the highest concentration. To improve this idea, we can use another plot: the Violin Plot

Question 1 (Cont)¶

The above idea had prospect, but a scatter plot is not the best plot to use here. Instead, a violin plot works similarly, yet gives more information. It displays the distribution of the variable in the y-axis for each value of the variable in the x-axis. A violin plot also provides the mean value for the distribution as a horizontal line through the 'violin'.

In [517]:
# For each year, we need to collect the scores of that year into an array (one for home and one for away teams)
years = sorted(list(fifa['Year'].unique()))
home_scores = []
away_scores = []

# For each year, append a list of the scores for each team
for y in years:
    home_scores.append( list(fifa[fifa['Year'] == y]['home_score']) )
    away_scores.append( list(fifa[fifa['Year'] == y]['away_score']) )

# Using subplots to create two plots, side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), sharey=True)
fig.suptitle('Distribution of Home Scores/Away Scores over Time')
ax1.violinplot(home_scores, years, widths=4,showmeans=True)
ax2.violinplot(away_scores, years, widths=4,showmeans=True)

# Labelling
ax1.set_xlabel("Year")
ax1.set_ylabel("Points Scored")
ax1.set_title("Home")

ax2.set_xlabel("Year")
ax2.set_ylabel("Points Scored")
ax2.set_title("Away")

plt.show()

With a violin plot, it is much clearer how the data is distributed each year. It seems that there is a different between the distributions for teams that are listed as home or away. Perhaps it's the case that even being labelled as an away team has a negative effect on performance? It also seems that home teams have scored less and less over time.

You might also notice that there is a gap in data between 1940 and 1950. You may not have thought about it before now, but those years line up with the timeline of World War II. It then makes a lot of sense that we weren't holding a world tournament at a time like that!

Question 2: What is the distribution of yellow cards over the years like?¶

For this question, a violin plot is also very effective. We want to see the distribution of yellow cards received each year, since their invention. Yellow (and red) cards started being used in the 1970 World Cup, although a similar mechanism of eliminating players was used in the 1950 tournament, for example. See this wikipedia article for more information.

For this question, it might be more effective to examine total yellow cards given in a game, rather than per home or away team.

In [524]:
# List of yellows
yellows = []

# For each year...
for y in years:
    curr = fifa[fifa['Year'] == y]
    yellows.append( list(curr['yellows']) ) # Append a list of the number of yellow cards given that year
    
fig, ax = plt.subplots(figsize=(15, 7))
ax.violinplot(yellows, years, widths=4,showmeans=True)  # Creating the violin plot

# Labelling
ax.set_xlabel("Year")
ax.set_ylabel("Yellow Cards Given")
ax.set_title("Distribution of Yellow Cards over Time")

plt.show()

It is quite apparent that the maximum number of yellow cards given in games, since the conception of the idea, has drastically increased over time. From a minimum of 0 to a maximum of 17 in a single game! That number belongs to the heated game between the Netherlands and Argentina.

It is also true that the mean number of yellow cards given has increased over time (according to the horizontal line of each violin), though not perfectly.

Question 3: What group of continents has the most winners?¶

Using the winner_cont field, we can count the number of wins for each continent over the entire history of the World Cup. For this, we will be removing every game that was a DRAW.

I prefer using a horizontal bar graph here because of the conservation of space. Since the counts are quite large, it's a better idea to have them extend sideways than upwards.

In [572]:
# Winners are counted per continent, and converted to a dictionary
winners = fifa.groupby(by='winner_cont')['winner'].agg('count').to_dict()
winners.pop('DRAW')  # removing games that ended with a DRAW

# Keys: number of wins
# Values: the continents themselves
x_axis = list(winners.keys())
y_axis = list(winners.values())
clrs = ['#59bfff', '#bfe6ff', '#0da2ff', '#8cd3ff', '#d9f1ff', '#26abff']  # a range of blues

# Plotting with a horizonal bar graph
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(x_axis, y_axis, color=clrs)

# Labelling
ax.set_xlabel("Games Won")
ax.set_ylabel("Continent")
ax.set_title("Bar Graph of Wins per Continent")

plt.show()
Question 4: How have the lead teams of 2022 done in the past?¶

Now that the semi finals are underway, the top 4 teams of this years World Cup are: Morocco and Croatia (competing for third place), France, and Argentina. I will create a line graph to depict the number of games these teams won in the past years. This is to get a sense for whether it may have been unexpected for these teams to make it this far, or if it was likely based on past games.

In [616]:
# Leading 4 teams of the 2022 World Cup
leads = ['Morocco', 'Argentina', 'Croatia', 'France']
plt.figure(figsize=(15, 7))

# For each team
for l in leads:
    
    # A dataframe of the wins for a specific team
    curr = fifa_past[fifa_past['winner'] == l].copy()
    
    # Grouping by year to count games won
    per_year = curr.groupby(by='Year')['winner_cont'].agg('count')

    plt.plot(per_year, label = l)  # Plotting, with a label for each line
    plt.legend()    
    
# Labelling
plt.title('Wins by the 2022 Top 4 Teams from 1930-2018')
plt.xlabel('Year')
plt.ylabel('Games Won') 
plt.ylim([0, 8])

plt.show()

Since I am not analyzing the draws of any team, these numbers imply that the teams made it past the eliminaries and into the final 16 bracket.

It seems that Argentina and France have done well in the World Cup since very early on, and continue to do well. Croatia is also a team that does well in the World Cup historically (the country was formed in 1991, so the track record is short). Meanwhile, Morocco is a country that has not done as well throughout the tournaments, only winning one game in the round of 16. Despite this, it has managed to secure a spot as top 4 teams this year.

In [601]:
# CONTINENT SHIET
# Which continent scores most

# ax = fifa.boxplot(column=['home_score', 'away_score'], by='home_continent', figsize=(16, 7), grid=False)
# plt.suptitle('Box Plot of Home Scores/Away Scores by Continent')
# plt.show()
Question 5: How are Yellow Cards split between Continents?¶

This final segment will use pie charts and focus on the trend between continents and points earned, as well as with yellow cards given.

We start with continents v.s. yellow card distribution. This is done by grouping wins per continent and summing them up. For each team, whether they were home or away, the number of yellow cards is counted up. In honor of the fact that we are plotting information about yellow cards, the pie chart is colored with shades of yellow.

In [603]:
# Set of continent names
labels = tuple(list(fifa['home_continent'].unique()))
sizes = []  # to be the sum of yellow cards earned

# For each continent...
for l in labels:
    
    # Add the number of cards earned by a team when they were 'home' + the cards earned by that team when they were 'away'
    sizes.append( sum(fifa[fifa['home_continent'] == l]['home_yellows'] ) 
                 + sum(fifa[fifa['away_continent'] == l]['away_yellows'] ) )
    
# Plotting, labelling
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Yellow Cards Given by Continent')

clrs = ['#e6cc00', '#e47200', '#e6cc00', '#e6b400', '#e69b00', '#e8e337'] # a range of yellow values

# Creating the pie chart
ax.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
In [562]:
# Score by continent
scores = []

for l in labels:
    scores.append( sum(fifa[fifa['home_continent'] == l]['home_score'] ) 
                  + sum(fifa[fifa['away_continent'] == l]['away_score'] ) )
    
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Points Scored by Continent')

clrs = ['#bfe6ff', '#0da2ff', '#8cd3ff', '#59bfff', '#26abff', '#d9f1ff'] # a range of blue values

ax.pie(scores, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
In [604]:
# how many teams per country?
home_teams = fifa.groupby('home_continent')['home_team'].agg('count')
home_hash = home_teams.to_dict()

away_teams = fifa.groupby('away_continent')['away_team'].agg('count')
away_hash = away_teams.to_dict()
In [605]:
# Yellows by Continent with respect to number of teams
# Pie chart:
labels = tuple(list(fifa['home_continent'].unique()))
sizes = []

for l in labels:
    sizes.append( ( sum(fifa[fifa['home_continent'] == l]['home_yellows']) / home_hash[l] )
                 + ( sum(fifa[fifa['away_continent'] == l]['away_yellows']) / away_hash[l] ) )
    

fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Yellow Cards Given by Continent per Team on Average')

clrs = ['#e6b400',  '#e8e337', '#e6cc00', '#e47200','#e5de00', '#e69b00'] # a range of yellow values

ax.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()
In [606]:
# Score by continent with respect to number of teams
scores = []

for l in labels:
    scores.append( ( sum(fifa[fifa['home_continent'] == l]['home_score']) / home_hash[l] )
                 + ( sum(fifa[fifa['away_continent'] == l]['away_score']) / away_hash[l] ) )
    
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Points Scored by Continent per Team on Average')

clrs = ['#bfe6ff', '#26abff', '#8cd3ff', '#59bfff',  '#0da2ff', '#d9f1ff'] # a range of blue values

ax.pie(scores, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

4. Hypothesis testing¶

  • relationship between year and yellow cards? start from 1950
  • is home xg dependent on away cont?? AND VICE VERSE
In [ ]:
fifa_1950 = fifa[fifa['Year'] >= 1950]
fifa_1950
In [ ]:
# yellows depend on year? answer YES see p val < 0.05
summ = smf.ols(formula = 'yellows ~ Year', data=fifa_1950).fit()
print(summ.summary())
In [ ]:
X = np.array(fifa_1950['Year']).reshape((-1, 1))
y = fifa_1950['yellows']

# Fitting a model using X and y.
reg = LinearRegression().fit(X, y)
pred = reg.predict(X)

# # Extracting the coefficient and intercept to give the formula (y = mx + b) of the model.
intercept = reg.intercept_
coef = reg.coef_[0]

# The formula for the model is printed below.
print('Model: y =', coef, 'x +', intercept)
In [ ]:
years = list(fifa_1950['Year'].unique())
plt.figure(figsize=(13, 9))
plt.scatter(fifa_1950['Year'], fifa_1950['yellows'], s=10)
plt.plot(X, pred, color='r')

plt.title('Yellow Cards Earned v.s. Year')
plt.xlabel('Year')
plt.ylabel('Yellow Cards Given')   

plt.show()
In [ ]:
# does my score depend on my continent
home_home = smf.ols(formula = 'home_score ~ home_continent', data=fifa).fit()
print(home_home.summary())
In [ ]:
# does my score depend on opponents continent
home_away = smf.ols(formula = 'home_score ~ away_continent', data=fifa).fit()
print(home_away.summary())
In [ ]:
# does my xg depend on my opponents continent?
home_xg = smf.ols(formula = 'home_xg ~ away_continent', data=fifa).fit()
print(home_xg.summary())

5. Communication of insights attained¶

In [ ]: